PS 818 - Statistical Models
September 8, 2025
\[ \require{cancel} \DeclareMathOperator*{\argmin}{arg\,min} \]
\[ \DeclareMathOperator*{\argmax}{arg\,max} \]
\[m(x) = x^\prime\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dotsc\]
\[\beta = \argmin_b \ \mathbb{E}[(Y_i - X_i^\prime b)^2]\]
\[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
\[m(X_i) = X_i^\prime\beta = X_i^\prime (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
\[e_i = Y_i - m(X_i)\]
\[Y_i = X_i^\prime\beta + e_i\]
\[\begin{align*} \mathbb{E}[X_i e_i] &= \mathbb{E}[X_i(Y_i - X_i^\prime\beta)]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime]\beta\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime](\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iY_i] = 0\\ \end{align*}\]
\[\begin{align*} Cov(X_i, \epsilon) = \mathbb{E}[X_ie_i] - E[X_i]E[e_i] = 0 - 0 = 0 \end{align*}\]
Remember, \(m(X)\) is a population quantity - can we come up with an estimator in our sample that is consistent for it?
Our estimand
\[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]
Our estimator
\[\hat{\beta} = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_iY_i\bigg)\]
Often written as:
\[\hat{\beta} = (\mathbf{X}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime Y)\]
\[\hat{\beta} = \beta + \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_ie_i\bigg) \]
We can use the weak law of large numbers
\[\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime \overset{p}{\to} \mathbb{E}[X_iX_i^\prime]\]
\[\frac{1}{n} \sum_{i=1}^n X_ie_i \overset{p}{\to} \mathbb{E}[X_ie_i] = 0\]
Plugging it all in, we have
\[\hat{\beta} \overset{p}{\to} \beta\]
Ordinary least squares is consistent for the best linear predictor under relatively mild assumptions
What have we not assumed
\[\sqrt{n}(\hat{\beta} - \beta) = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i\bigg)\]
We can apply the CLT to the other term. We know the expectation is zero, what about the variance?
\[Var(X_ie_i) = \mathbb{E}[X_i e_i (X_i e_i)^\prime] = \mathbb{E}[e_i^2X_i X_i^\prime]\]
By the CLT, we have
\[\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i \overset{d}{\to} \mathcal{N}(0, \mathbb{E}[e_i^2X_i X_i^\prime])\]
Combining with our other convergence result using Slutsky’s theorem gives
\[\sqrt{n}(\hat{\beta} - \beta) \overset{d}{\to} \mathcal{N}\bigg(0, (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\bigg)\]
Again, we have a bunch of population expectations, let’s plug in their sample equivalents!
Our estimand
\[Var(\hat{\beta}) = \frac{1}{n} (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\]
Our estimator
\[\widehat{Var(\hat{\beta})} = \frac{1}{n} \bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n \hat{e_i}^2 X_i X_i^\prime \bigg)\bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\]
\[\widehat{Var(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} (X^\prime \hat{\Sigma} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] where \(\hat{\Sigma} = \text{diag}(\hat{e_1}^2, \hat{e_2}^2, \dotsc \hat{e_n}^2)\)
Call:
lm(formula = voteshare ~ pctWhiteNH, data = election)
Residuals:
Min 1Q Median 3Q Max
-35.82 -8.31 -0.85 7.20 45.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.8083 0.9059 74.9 <2e-16 ***
pctWhiteNH -0.4617 0.0112 -41.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.2 on 3112 degrees of freedom
Multiple R-squared: 0.354, Adjusted R-squared: 0.353
F-statistic: 1.7e+03 on 1 and 3112 DF, p-value: <2e-16
lm_robust in estimatrWe’ve shown that the OLS estimator is consistent and asymptotically normal for the Best Linear Predictor
To do inference on the CEF, we assume that it’s equal to the BLP!
Assumption 1: Linearity
\[Y_i = X_i^\prime\beta + \epsilon_i\]
Assumption 2: Strict exogeneity of the errors
\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
What does this get us?
When is linearity always true?
\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]
\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]
Lastly, by law of total expectation
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
Note that the assumptions here are slightly stronger than what we needed for consistency for the BLP
But what have we not assumed?
Assumption 4 - Spherical errors
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]
Benefits
Drawbacks
Under spherical errors, the variance formula simplifies
\[Var(\beta) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2\mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] \[Var(\beta) = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]
All we need is an estimator for \(\sigma^2\)
An unbiased estimator for \(\sigma^2\) is the sample variance of the residuals (mean squared error) adjusted by a degrees of freedom correction
\[\hat{\sigma^2} = \frac{1}{n-k} \sum_{i=1}^n (Y_i - \hat{Y_i})^2\]
\(k\) is the number of parameters. We refer to \(n-k\) as the “degrees of freedom”
One way to think about the fitted values from OLS is that they are a projection of the n-dimensional vector \(Y\) into the column space of \(\mathbf{X}\)
We define the projection matrix or hat matrix
\[\mathbf{P}_{\mathbf{X}} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}\]
Our fitted values \(\hat{Y}\) are a projection of \(Y\) to the “closest” vector that can be represented as a linear combination of the columns of \(\mathbf{X}\)
\[\hat{Y} = \mathbf{P}_{\mathbf{X}}Y = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}Y = \mathbf{X}\hat{\beta}\]
Proof: Suppose there’s another linear estimator that diverges from the OLS projection of \(Y\) by a non-zero \(k \times n\) matrix \(D\)
\[\tilde{\beta} = ((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D)Y\]
Let’s take the expectation
\[\mathbb{E}[\tilde{\beta}] = \mathbb{E}[\hat{\beta}_{\text{OLS}}] + \mathbb{E}[DY]\]
We know OLS is unbiased. Next, we plug in the expression for \(Y\) (linearity)
\[\mathbb{E}[\tilde{\beta}] = \beta + \mathbb{E}[D\mathbf{X}\beta] + \mathbb{E}[D\epsilon]\]
Pull out the constants and use \(\mathbb{E}[\epsilon] = 0\)
\[\mathbb{E}[\tilde{\beta}] = \beta(\mathbf{I} + D\mathbf{X})\]
If \(\tilde{\beta}\) is unbiased, then
\[\beta(\mathbf{I} + D\mathbf{X}) = 0\]
This holds for all values of \(\beta\) only if \(D\mathbf{X} = 0\)
Under homoskedasticity, the variance is
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)^{\prime}\]
Using properties of matrix transposes
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg(\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + D^\prime\bigg)\]
Expanding
\[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + (\mathbf{X}^\prime\mathbf{X})^{-1}(D\mathbf{X})^\prime + D\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + DD^\prime\bigg)\]
Using \(D\mathbf{X} = 0\) from above along with the definition of the variance of \(\hat{\beta}_{\text{OLS}}\)
\[Var(\tilde{\beta}) = Var(\hat{\beta}_{\text{OLS}}) + \sigma^2DD^\prime\]
The variance of our alternative linear, unbiased estimator exceeds the variance of OLS by the positive semi-definite matrix \(\sigma^2DD^\prime\)
Therefore OLS is the lowest-variance linear unbiased estimator
\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
It can be difficult to understand the mechanics of OLS with many regressors.
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon\]
\[\hat{\beta} = \frac{\widehat{Cov}(Y_i, X_i)}{\widehat{Var}(X_i)}\]
Can we get something like this in the multivariate case?
\[Y = \mathbf{X}\beta + \epsilon\]
With multiple predictors, we can write the regression coefficient on any of our regressors in terms of partial regressions.
\[Y = \mathbb{X}_1\beta_1 + \mathbb{X}_2\beta_2 + \epsilon\]
The Frisch-Waugh-Lovell theorem, states that the OLS estimator of \(\hat{\beta_1}\) can be recovered by:
voteshare_means <- election %>% group_by(state) %>% summarize(voteshare_state = mean(voteshare))
pctWhiteNH_means <- election %>% group_by(state) %>% summarize(pctWhiteNH_state = mean(pctWhiteNH))
election <- election %>% left_join(voteshare_means, by="state")
election <- election %>% left_join(pctWhiteNH_means, by="state")
election$demeaned_y <- election$voteshare - election$voteshare_state
election$demeaned_x <- election$pctWhiteNH - election$pctWhiteNH_state
tidy(lm(demeaned_y ~ demeaned_x, data=election)) %>% filter(term == "demeaned_x")# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 demeaned_x -0.650 0.0105 -62.0 0
It is common to see the linear model written in its fully parametric form.
Stochastic:
\[Y_i \sim \text{Normal}(\mu_i, \sigma^2)\]
Systematic:
\[\mu_i = X_i^{\prime}\beta\]
What’s assumed to be known?
What’s assumed to have a particular distribution?
We are interested in estimating and conducting inference on the parameters: \(\beta\) and (less importantly) \(\sigma^2\).
We can specify a broad set of models for \(Y_i\) using this framework
Stochastic
\[Y_i \sim f(\theta_i, \alpha)\]
Systematic
\[\theta_i = g(X_i, \beta)\]
What are these quantities?
We will spend some time with a particular class of models called “Generalized Linear Models” where the systematic component has the form
\[\theta_i = g(X_i^{\prime}\beta)\]
Continuous on an unbounded support \((-\infty, \infty)\)
Two parameters: Mean \(\mu\) and Variance \(\sigma^2\)
Probability Density Function (PDF)
\[f_N(x;\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left\{- \frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2 \right\}\]
Discrete, defined on the support of the natural numbers (positive integers + zero)
Single parameter: Mean and Variance \(\lambda\)
Probability Density Function (PDF)
\[f(x; \lambda) = \frac{\lambda^x\exp\{-\lambda\}}{x!}\]
\[f(x; p, n) = {n \choose x} p^x(1-p)^{n-x}\]
Consider data \(\mathbf{y} = \{y_1, y_2, \dotsc, y_n\}\) which we assume comes from a known distribution with density \(f(\mathbf{y} | \theta)\) with unknown parameter \(\theta\)
The likelihood function is a function of \(\theta\) that is evaluated at the observed values of \(\mathbf{y}\)
\[\mathcal{L}(\theta| \mathbf{y}) = f(\mathbf{y} | \theta)\]
Given some input \(\theta\), the likelihood function returns the density of the observed data evaluated at that value.
The likelihood function is not a probability density
The likelihood function is not \(f(\theta | \mathbf{y})\)
Even when we (later) move to a Bayesian framework \(f(\theta | \mathbf{y}) \neq f(\mathbf{y} | \theta)\)
Remember Bayes’ Rule:
\[f(\theta | \mathbf{y}) = \frac{f(\mathbf{y} | \theta)}{f(\mathbf{y})} \times f(\theta)\]
Or alternatively, since \(f(\mathbf{y})\) is just a normalizing constant, you’ll see this written as:
\[\underbrace{f(\theta | \mathbf{y})}_{\text{posterior}} \propto \underbrace{f(\mathbf{y}|\theta)}_{\text{likelihood}} \times \underbrace{f(\theta)}_{\text{prior}}\]
We often make the assumption that our data are independently and identically distributed
This allows us to factor the likelihood
\[\mathcal{L}(\theta|\mathbf{y}) = f(\mathbf{y} | \theta) = \prod_{i=1}^n f(y_i | \theta)\]
Since our eventual goal will be to find an optimum of the likelihood, we can apply any monotonic function to it since that preserves maxima/minima.
The log-likelihood is typically denoted \(\ell(\theta| \mathbf{y})\)
\[\ell(\theta|\mathbf{y}) = \log f(\mathbf{y}|\theta) = \sum_{i=1}^n \log f(y_i | \theta)\]
\[\hat{\theta} = \argmax_\theta \ \log f(\mathbf{y}|\theta)\]
Before illustrating consistency, we need to define some additional functions of the log-likelihood.
The first is the score or the gradient of \(\ell(\theta|\mathbf{y})\) with respect to \(\theta\)
\[\mathbf{S} = \frac{\partial}{\partial\theta} \log f(\mathbf{y}|\theta)\]
With i.i.d. data, you’ll often see the score written in terms of the score for an individual observation \(i\)
\[\mathbf{S}_i = \frac{\partial}{\partial\theta} \log f(y_i|\theta)\]
At the value of the true parameter \(\theta_0\), the expected value of the score is \(0\) (under some regularity conditions)
\[\mathbb{E}\bigg[\frac{\partial}{\partial\theta} \log f(y_i|\theta_0)\bigg] = \frac{\partial}{\partial\theta} \mathbb{E}[\log f(y_i|\theta_0)] = 0\]
With this definition of the score, we can show consistency of the MLE under i.i.d. observations.
By the weak law of large numbers
\[\frac{1}{n} \sum_{i=1}^n \log f(y_i | \theta) \underset{p}{\rightarrow} \mathbb{E}[\log f(y_i | \theta)]\]
The MLE is an optimizer of the left-hand side.
And we just showed that the score is \(0\) at the true value of \(\theta\), denoted \(\theta_0\)
Therefore the MLE \(\hat{\theta}\) is consistent for the true value \(\theta_0\)
There are two important conditions for consistency that can be violated in practice
Identifiability
\[f(\mathbf{y}|\theta) \neq f(\mathbf{y} | \theta_0) \ \forall \ \theta \neq \theta_0\]
Fixed parameter space
PS 818 - University of Wisconsin - Madison